import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/new-york-city-buildings-energy-consumption-survey/Dockerfile /kaggle/input/new-york-city-buildings-energy-consumption-survey/LICENSE /kaggle/input/new-york-city-buildings-energy-consumption-survey/.gitignore /kaggle/input/new-york-city-buildings-energy-consumption-survey/README.md /kaggle/input/new-york-city-buildings-energy-consumption-survey/requirements.txt /kaggle/input/new-york-city-buildings-energy-consumption-survey/model/app.yaml /kaggle/input/new-york-city-buildings-energy-consumption-survey/model/README.md /kaggle/input/new-york-city-buildings-energy-consumption-survey/model/XGBModel/conda.yaml /kaggle/input/new-york-city-buildings-energy-consumption-survey/model/XGBModel/MLmodel /kaggle/input/new-york-city-buildings-energy-consumption-survey/model/XGBModel/model.xgb /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/GreenBuildings3.ipynb /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/GreenBuildings2.ipynb /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/GreenBuildings1.ipynb /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/utilities/GetNeighborhoodNames.py /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/utilities/CleaningFunctions.py /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/utilities/PlottingFunctions.py /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/DockerRun.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/LinearModel2.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/building_type.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/pushtogcr.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/GCR.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/Energy_Star.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/AppEngine.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/XGBoostServe1_P2.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/Correlations.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/DockerImageSize.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/bigquery.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/Compare.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/Success.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/XGBoostRun.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/XGBoostGrid.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/LinearModel1.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/deploy.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/iforest.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/EmptyMLFlow.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/XGBoostServe1.png /kaggle/input/new-york-city-buildings-energy-consumption-survey/data/nyc-zip-code-tabulation-areas-polygons.geojson /kaggle/input/new-york-city-buildings-energy-consumption-survey/data/nyc_benchmarking_disclosure_data_reported_in_2016.xlsx /kaggle/input/new-york-city-buildings-energy-consumption-survey/data/neighborhoods.pkl
Importing the Data set¶
df = pd.read_excel('/kaggle/input/new-york-city-buildings-energy-consumption-survey/data/nyc_benchmarking_disclosure_data_reported_in_2016.xlsx')
Preliminary Analysis¶
df.shape
(13223, 57)
df.head(5)
| Record Number | Order | NYC Borough, Block and Lot (BBL) | Co-reported BBL Status | BBLs Co-reported | Reported NYC Building Identification Numbers (BINs) | Property Name | Parent Property Id | Parent Property Name | Street Number | ... | Direct GHG Emissions (Metric Tons CO2e) | Indirect GHG Emissions (Metric Tons CO2e) | DOF Property Floor Area (ft²) | Property GFA - Self-reported (ft²) | Water Use (All Water Sources) (kgal) | Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) | Release Date | DEP Provided Water Use (kgal) | Automatic Water Benchmarking Eligible | Reported Water Method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4113711.0 | 1 | 1.003360e+09 | NaN | NaN | 1088884 | 16 Bialystoker Place | Not Applicable: Standalone Property | Not Applicable: Standalone Property | 15.0 | ... | 290.2 | 146.1 | 98052.0 | 98000.0 | NaN | NaN | 2016-04-13 11:08:48 | NaN | NaN | NaN |
| 1 | 4630135.0 | 2 | 1.010618e+09 | NaN | NaN | 1026728 | (7478) - 432 West 52nd Street Condominium | Not Applicable: Standalone Property | Not Applicable: Standalone Property | 432.0 | ... | 76.4 | 125.0 | 58851.0 | 57401.0 | 768.5 | 13.39 | 2016-04-26 21:52:05 | NaN | Not Eligible | NaN |
| 2 | 4626594.0 | 3 | 1.014270e+09 | NaN | NaN | 1043974 | (9250) - 250 East 73rd Street | Not Applicable: Standalone Property | Not Applicable: Standalone Property | 1393.0 | ... | 145.5 | 392.9 | 166432.0 | 166433.0 | 11901.5 | 71.51 | 2016-04-26 22:56:36 | 11897.0 | Eligible | Manual |
| 3 | 4626608.0 | 4 | 1.015180e+09 | NaN | NaN | 1048117 | (9141) - 141 East 89th Street | Not Applicable: Standalone Property | Not Applicable: Standalone Property | 1341.0 | ... | 304.6 | 394.7 | 114939.0 | 114939.0 | 7455.1 | 64.86 | 2016-04-26 22:38:03 | 7455.1 | Eligible | ABS |
| 4 | 5048763.0 | 5 | 1.003920e+09 | NaN | NaN | 1079685 | 605 East 9th ST | Not Applicable: Standalone Property | Not Applicable: Standalone Property | 350.0 | ... | 0.0 | 0.0 | 93420.0 | 111000.0 | NaN | NaN | 2016-07-28 17:28:14 | NaN | NaN | NaN |
5 rows × 57 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13223 entries, 0 to 13222 Data columns (total 57 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Record Number 11531 non-null float64 1 Order 13223 non-null int64 2 NYC Borough, Block and Lot (BBL) 13222 non-null float64 3 Co-reported BBL Status 615 non-null object 4 BBLs Co-reported 615 non-null object 5 Reported NYC Building Identification Numbers (BINs) 11978 non-null object 6 Property Name 11531 non-null object 7 Parent Property Id 11531 non-null object 8 Parent Property Name 11531 non-null object 9 Street Number 10380 non-null float64 10 Street Name 13223 non-null object 11 Zip Code 12627 non-null float64 12 Borough 13223 non-null object 13 DOF Benchmarking Submission Status 13223 non-null object 14 Primary Property Type - Self Selected 11978 non-null object 15 List of All Property Use Types at Property 11978 non-null object 16 Largest Property Use Type 11978 non-null object 17 Largest Property Use Type - Gross Floor Area (ft²) 11527 non-null float64 18 2nd Largest Property Use Type 11978 non-null object 19 2nd Largest Property Use - Gross Floor Area (ft²) 11978 non-null object 20 3rd Largest Property Use Type 11978 non-null object 21 3rd Largest Property Use Type - Gross Floor Area (ft²) 11978 non-null object 22 Year Built 11978 non-null object 23 Number of Buildings - Self-reported 11526 non-null float64 24 Occupancy 11531 non-null float64 25 Metered Areas (Energy) 11978 non-null object 26 Metered Areas (Water) 11978 non-null object 27 ENERGY STAR Score 9535 non-null float64 28 Site EUI (kBtu/ft²) 11439 non-null float64 29 Weather Normalized Site EUI (kBtu/ft²) 9848 non-null float64 30 Weather Normalized Site Electricity Intensity (kWh/ft²) 11095 non-null float64 31 Weather Normalized Site Natural Gas Intensity (therms/ft²) 9463 non-null float64 32 Source EUI (kBtu/ft²) 11439 non-null float64 33 Weather Normalized Source EUI (kBtu/ft²) 9848 non-null float64 34 Fuel Oil #1 Use (kBtu) 11978 non-null object 35 Fuel Oil #2 Use (kBtu) 11978 non-null object 36 Fuel Oil #4 Use (kBtu) 11978 non-null object 37 Fuel Oil #5 & 6 Use (kBtu) 11978 non-null object 38 Diesel #2 Use (kBtu) 11978 non-null object 39 District Steam Use (kBtu) 11978 non-null object 40 District Hot Water Use (kBtu) 11978 non-null object 41 District Chilled Water Use (kBtu) 11978 non-null object 42 Natural Gas Use (kBtu) 10087 non-null float64 43 Weather Normalized Site Natural Gas Use (therms) 9465 non-null float64 44 Electricity Use - Grid Purchase (kBtu) 11425 non-null float64 45 Weather Normalized Site Electricity (kWh) 11097 non-null float64 46 Total GHG Emissions (Metric Tons CO2e) 11478 non-null float64 47 Direct GHG Emissions (Metric Tons CO2e) 11471 non-null float64 48 Indirect GHG Emissions (Metric Tons CO2e) 11495 non-null float64 49 DOF Property Floor Area (ft²) 12627 non-null float64 50 Property GFA - Self-reported (ft²) 11531 non-null float64 51 Water Use (All Water Sources) (kgal) 7265 non-null float64 52 Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) 6391 non-null float64 53 Release Date 11531 non-null datetime64[ns] 54 DEP Provided Water Use (kgal) 5247 non-null float64 55 Automatic Water Benchmarking Eligible 7265 non-null object 56 Reported Water Method 5247 non-null object dtypes: datetime64[ns](1), float64(26), int64(1), object(29) memory usage: 5.8+ MB
df.describe()
| Record Number | Order | NYC Borough, Block and Lot (BBL) | Street Number | Zip Code | Largest Property Use Type - Gross Floor Area (ft²) | Number of Buildings - Self-reported | Occupancy | ENERGY STAR Score | Site EUI (kBtu/ft²) | ... | Weather Normalized Site Electricity (kWh) | Total GHG Emissions (Metric Tons CO2e) | Direct GHG Emissions (Metric Tons CO2e) | Indirect GHG Emissions (Metric Tons CO2e) | DOF Property Floor Area (ft²) | Property GFA - Self-reported (ft²) | Water Use (All Water Sources) (kgal) | Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) | Release Date | DEP Provided Water Use (kgal) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.153100e+04 | 13223.00000 | 1.322200e+04 | 10380.000000 | 12627.000000 | 1.152700e+04 | 11526.000000 | 11531.000000 | 9535.000000 | 11439.000000 | ... | 1.109700e+04 | 1.147800e+04 | 1.147100e+04 | 1.149500e+04 | 1.262700e+04 | 1.153100e+04 | 7.265000e+03 | 6391.000000 | 11531 | 5247.000000 |
| mean | 3.440970e+06 | 6612.00000 | 2.198121e+09 | 996.288247 | 10569.869011 | 1.641104e+05 | 1.387819 | 98.987512 | 57.735711 | 525.733377 | ... | 2.380525e+06 | 6.952577e+03 | 1.340512e+03 | 5.605184e+03 | 1.636773e+05 | 1.709439e+05 | 2.579751e+04 | 126.303422 | 2016-05-15 11:24:30.392767488 | 6478.597370 |
| min | 7.365000e+03 | 1.00000 | 1.000048e+09 | 0.000000 | 0.000000 | 5.800000e+01 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -2.335860e+04 | 5.000800e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 2016-02-19 12:08:50 | 1.000000 |
| 25% | 2.734774e+06 | 3306.50000 | 1.013110e+09 | 146.000000 | 10024.000000 | 6.536950e+04 | 1.000000 | 100.000000 | 34.000000 | 65.300000 | ... | 3.026190e+05 | 3.420250e+02 | 1.715000e+02 | 8.630000e+01 | 6.500000e+04 | 6.717600e+04 | 2.661700e+03 | 28.240000 | 2016-04-21 10:03:55 | 2876.300000 |
| 50% | 3.111728e+06 | 6612.00000 | 2.029920e+09 | 441.000000 | 10456.000000 | 9.226600e+04 | 1.000000 | 100.000000 | 63.000000 | 82.400000 | ... | 5.454674e+05 | 5.198000e+02 | 3.051000e+02 | 1.580000e+02 | 9.240000e+04 | 9.491000e+04 | 4.745600e+03 | 45.700000 | 2016-04-29 09:31:42 | 4799.800000 |
| 75% | 4.371063e+06 | 9917.50000 | 3.061218e+09 | 1300.250000 | 11222.000000 | 1.557285e+05 | 1.000000 | 100.000000 | 83.000000 | 103.000000 | ... | 1.311119e+06 | 9.394500e+02 | 5.050000e+02 | 4.067000e+02 | 1.563555e+05 | 1.615295e+05 | 8.057900e+03 | 71.680000 | 2016-05-31 13:43:25 | 7732.800000 |
| max | 5.053345e+06 | 13223.00000 | 5.080080e+09 | 76179.000000 | 11694.000000 | 6.940450e+06 | 161.000000 | 100.000000 | 100.000000 | 801504.700000 | ... | 3.046183e+09 | 1.501468e+07 | 5.852949e+06 | 1.501455e+07 | 1.343508e+07 | 6.940450e+06 | 4.385740e+07 | 73011.090000 | 2016-08-01 23:04:11 | 129781.600000 |
| std | 8.500920e+05 | 3817.29564 | 1.218340e+09 | 1550.762447 | 645.826741 | 2.569770e+05 | 3.631729 | 6.941736 | 30.143817 | 10120.105154 | ... | 3.576546e+07 | 1.692231e+05 | 5.949074e+04 | 1.583246e+05 | 2.993519e+05 | 2.680820e+05 | 5.860239e+05 | 1483.749849 | NaN | 7090.965112 |
8 rows × 28 columns
df.describe(include='object')
| Co-reported BBL Status | BBLs Co-reported | Reported NYC Building Identification Numbers (BINs) | Property Name | Parent Property Id | Parent Property Name | Street Name | Borough | DOF Benchmarking Submission Status | Primary Property Type - Self Selected | ... | Fuel Oil #1 Use (kBtu) | Fuel Oil #2 Use (kBtu) | Fuel Oil #4 Use (kBtu) | Fuel Oil #5 & 6 Use (kBtu) | Diesel #2 Use (kBtu) | District Steam Use (kBtu) | District Hot Water Use (kBtu) | District Chilled Water Use (kBtu) | Automatic Water Benchmarking Eligible | Reported Water Method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 615 | 615 | 11978 | 11531 | 11531 | 11531 | 13223 | 13223 | 13223 | 11978 | ... | 11978 | 11978 | 11978 | 11978 | 11978 | 11978 | 11978 | 11978 | 7265 | 5247 |
| unique | 3 | 169 | 11385 | 11522 | 107 | 107 | 2274 | 8 | 3 | 53 | ... | 8 | 2291 | 1319 | 464 | 17 | 927 | 2 | 12 | 2 | 2 |
| top | Secondary | 3017940054;3017940119;3017940121;3017940139;30... | See Primary BBL | Club Quarters | Not Applicable: Standalone Property | Not Applicable: Standalone Property | Not Available | Manhattan | In Compliance | Multifamily Housing | ... | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Not Available | Eligible | ABS |
| freq | 410 | 48 | 447 | 3 | 11330 | 11330 | 596 | 5694 | 11762 | 8699 | ... | 11525 | 8845 | 10108 | 10732 | 11510 | 10599 | 11531 | 11521 | 6317 | 5056 |
4 rows × 29 columns
df.isna().sum()
Record Number 1692 Order 0 NYC Borough, Block and Lot (BBL) 1 Co-reported BBL Status 12608 BBLs Co-reported 12608 Reported NYC Building Identification Numbers (BINs) 1245 Property Name 1692 Parent Property Id 1692 Parent Property Name 1692 Street Number 2843 Street Name 0 Zip Code 596 Borough 0 DOF Benchmarking Submission Status 0 Primary Property Type - Self Selected 1245 List of All Property Use Types at Property 1245 Largest Property Use Type 1245 Largest Property Use Type - Gross Floor Area (ft²) 1696 2nd Largest Property Use Type 1245 2nd Largest Property Use - Gross Floor Area (ft²) 1245 3rd Largest Property Use Type 1245 3rd Largest Property Use Type - Gross Floor Area (ft²) 1245 Year Built 1245 Number of Buildings - Self-reported 1697 Occupancy 1692 Metered Areas (Energy) 1245 Metered Areas (Water) 1245 ENERGY STAR Score 3688 Site EUI (kBtu/ft²) 1784 Weather Normalized Site EUI (kBtu/ft²) 3375 Weather Normalized Site Electricity Intensity (kWh/ft²) 2128 Weather Normalized Site Natural Gas Intensity (therms/ft²) 3760 Source EUI (kBtu/ft²) 1784 Weather Normalized Source EUI (kBtu/ft²) 3375 Fuel Oil #1 Use (kBtu) 1245 Fuel Oil #2 Use (kBtu) 1245 Fuel Oil #4 Use (kBtu) 1245 Fuel Oil #5 & 6 Use (kBtu) 1245 Diesel #2 Use (kBtu) 1245 District Steam Use (kBtu) 1245 District Hot Water Use (kBtu) 1245 District Chilled Water Use (kBtu) 1245 Natural Gas Use (kBtu) 3136 Weather Normalized Site Natural Gas Use (therms) 3758 Electricity Use - Grid Purchase (kBtu) 1798 Weather Normalized Site Electricity (kWh) 2126 Total GHG Emissions (Metric Tons CO2e) 1745 Direct GHG Emissions (Metric Tons CO2e) 1752 Indirect GHG Emissions (Metric Tons CO2e) 1728 DOF Property Floor Area (ft²) 596 Property GFA - Self-reported (ft²) 1692 Water Use (All Water Sources) (kgal) 5958 Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) 6832 Release Date 1692 DEP Provided Water Use (kgal) 7976 Automatic Water Benchmarking Eligible 5958 Reported Water Method 7976 dtype: int64
df.duplicated().sum()
0
Removing unnecessary columns¶
df1 = df.drop(columns = ['Reported NYC Building Identification Numbers (BINs)', 'Record Number', 'Order', 'Street Number', 'Release Date', 'Co-reported BBL Status', 'BBLs Co-reported', 'Parent Property Id', 'Parent Property Name', 'List of All Property Use Types at Property', 'Reported Water Method', 'Number of Buildings - Self-reported', 'Occupancy'])
df1.sample(5)
| NYC Borough, Block and Lot (BBL) | Property Name | Street Name | Zip Code | Borough | DOF Benchmarking Submission Status | Primary Property Type - Self Selected | Largest Property Use Type | Largest Property Use Type - Gross Floor Area (ft²) | 2nd Largest Property Use Type | ... | Weather Normalized Site Electricity (kWh) | Total GHG Emissions (Metric Tons CO2e) | Direct GHG Emissions (Metric Tons CO2e) | Indirect GHG Emissions (Metric Tons CO2e) | DOF Property Floor Area (ft²) | Property GFA - Self-reported (ft²) | Water Use (All Water Sources) (kgal) | Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) | DEP Provided Water Use (kgal) | Automatic Water Benchmarking Eligible | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8688 | 3.051220e+09 | Cascade: 229 East 18th Street | EAST 18 STREET | 11226.0 | Brooklyn | In Compliance | Multifamily Housing | Multifamily Housing | 67984.0 | Not Available | ... | 200096.0 | 320.4 | 263.8 | 56.6 | 58272.0 | 67984.0 | 5538.9 | 81.47 | 5538.9 | Eligible |
| 12446 | 2.028140e+09 | NaN | ANTHONY AVENUE | 10457.0 | Bronx | In Violation | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 56400.0 | NaN | NaN | NaN | NaN | NaN |
| 5794 | 2.027220e+09 | 887 Southern Blvd | SOUTHERN BOULEVARD | 10459.0 | Bronx | In Compliance | Multifamily Housing | Multifamily Housing | 102166.0 | Outpatient Rehabilitation/Physical Therapy | ... | 505340.4 | 437.4 | 294.4 | 143.0 | 102166.0 | 114517.0 | 6115.3 | NaN | NaN | Eligible |
| 201 | 1.001348e+09 | 25 Murray Street | MURRAY STREET | 10007.0 | Manhattan | In Compliance | Multifamily Housing | Multifamily Housing | 144330.0 | Not Available | ... | 917247.4 | 523.3 | 261.0 | 262.2 | 144330.0 | 144330.0 | NaN | NaN | NaN | NaN |
| 6651 | 2.033420e+09 | ROSENBERG: 3162 Bainbridge Ave | BAINBRIDGE AVENUE | 10467.0 | Bronx | In Compliance | Multifamily Housing | Multifamily Housing | 65320.0 | Retail Store | ... | 286224.0 | 417.9 | 336.9 | 81.0 | 71000.0 | 66515.0 | NaN | NaN | NaN | NaN |
5 rows × 44 columns
df1.shape
(13223, 44)
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13223 entries, 0 to 13222 Data columns (total 44 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NYC Borough, Block and Lot (BBL) 13222 non-null float64 1 Property Name 11531 non-null object 2 Street Name 13223 non-null object 3 Zip Code 12627 non-null float64 4 Borough 13223 non-null object 5 DOF Benchmarking Submission Status 13223 non-null object 6 Primary Property Type - Self Selected 11978 non-null object 7 Largest Property Use Type 11978 non-null object 8 Largest Property Use Type - Gross Floor Area (ft²) 11527 non-null float64 9 2nd Largest Property Use Type 11978 non-null object 10 2nd Largest Property Use - Gross Floor Area (ft²) 11978 non-null object 11 3rd Largest Property Use Type 11978 non-null object 12 3rd Largest Property Use Type - Gross Floor Area (ft²) 11978 non-null object 13 Year Built 11978 non-null object 14 Metered Areas (Energy) 11978 non-null object 15 Metered Areas (Water) 11978 non-null object 16 ENERGY STAR Score 9535 non-null float64 17 Site EUI (kBtu/ft²) 11439 non-null float64 18 Weather Normalized Site EUI (kBtu/ft²) 9848 non-null float64 19 Weather Normalized Site Electricity Intensity (kWh/ft²) 11095 non-null float64 20 Weather Normalized Site Natural Gas Intensity (therms/ft²) 9463 non-null float64 21 Source EUI (kBtu/ft²) 11439 non-null float64 22 Weather Normalized Source EUI (kBtu/ft²) 9848 non-null float64 23 Fuel Oil #1 Use (kBtu) 11978 non-null object 24 Fuel Oil #2 Use (kBtu) 11978 non-null object 25 Fuel Oil #4 Use (kBtu) 11978 non-null object 26 Fuel Oil #5 & 6 Use (kBtu) 11978 non-null object 27 Diesel #2 Use (kBtu) 11978 non-null object 28 District Steam Use (kBtu) 11978 non-null object 29 District Hot Water Use (kBtu) 11978 non-null object 30 District Chilled Water Use (kBtu) 11978 non-null object 31 Natural Gas Use (kBtu) 10087 non-null float64 32 Weather Normalized Site Natural Gas Use (therms) 9465 non-null float64 33 Electricity Use - Grid Purchase (kBtu) 11425 non-null float64 34 Weather Normalized Site Electricity (kWh) 11097 non-null float64 35 Total GHG Emissions (Metric Tons CO2e) 11478 non-null float64 36 Direct GHG Emissions (Metric Tons CO2e) 11471 non-null float64 37 Indirect GHG Emissions (Metric Tons CO2e) 11495 non-null float64 38 DOF Property Floor Area (ft²) 12627 non-null float64 39 Property GFA - Self-reported (ft²) 11531 non-null float64 40 Water Use (All Water Sources) (kgal) 7265 non-null float64 41 Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) 6391 non-null float64 42 DEP Provided Water Use (kgal) 5247 non-null float64 43 Automatic Water Benchmarking Eligible 7265 non-null object dtypes: float64(22), object(22) memory usage: 4.4+ MB
df1.isna().sum()
NYC Borough, Block and Lot (BBL) 1 Property Name 1692 Street Name 0 Zip Code 596 Borough 0 DOF Benchmarking Submission Status 0 Primary Property Type - Self Selected 1245 Largest Property Use Type 1245 Largest Property Use Type - Gross Floor Area (ft²) 1696 2nd Largest Property Use Type 1245 2nd Largest Property Use - Gross Floor Area (ft²) 1245 3rd Largest Property Use Type 1245 3rd Largest Property Use Type - Gross Floor Area (ft²) 1245 Year Built 1245 Metered Areas (Energy) 1245 Metered Areas (Water) 1245 ENERGY STAR Score 3688 Site EUI (kBtu/ft²) 1784 Weather Normalized Site EUI (kBtu/ft²) 3375 Weather Normalized Site Electricity Intensity (kWh/ft²) 2128 Weather Normalized Site Natural Gas Intensity (therms/ft²) 3760 Source EUI (kBtu/ft²) 1784 Weather Normalized Source EUI (kBtu/ft²) 3375 Fuel Oil #1 Use (kBtu) 1245 Fuel Oil #2 Use (kBtu) 1245 Fuel Oil #4 Use (kBtu) 1245 Fuel Oil #5 & 6 Use (kBtu) 1245 Diesel #2 Use (kBtu) 1245 District Steam Use (kBtu) 1245 District Hot Water Use (kBtu) 1245 District Chilled Water Use (kBtu) 1245 Natural Gas Use (kBtu) 3136 Weather Normalized Site Natural Gas Use (therms) 3758 Electricity Use - Grid Purchase (kBtu) 1798 Weather Normalized Site Electricity (kWh) 2126 Total GHG Emissions (Metric Tons CO2e) 1745 Direct GHG Emissions (Metric Tons CO2e) 1752 Indirect GHG Emissions (Metric Tons CO2e) 1728 DOF Property Floor Area (ft²) 596 Property GFA - Self-reported (ft²) 1692 Water Use (All Water Sources) (kgal) 5958 Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) 6832 DEP Provided Water Use (kgal) 7976 Automatic Water Benchmarking Eligible 5958 dtype: int64
df1.nunique()
NYC Borough, Block and Lot (BBL) 13222 Property Name 11522 Street Name 2274 Zip Code 192 Borough 8 DOF Benchmarking Submission Status 3 Primary Property Type - Self Selected 53 Largest Property Use Type 52 Largest Property Use Type - Gross Floor Area (ft²) 9355 2nd Largest Property Use Type 56 2nd Largest Property Use - Gross Floor Area (ft²) 2231 3rd Largest Property Use Type 46 3rd Largest Property Use Type - Gross Floor Area (ft²) 977 Year Built 155 Metered Areas (Energy) 10 Metered Areas (Water) 7 ENERGY STAR Score 100 Site EUI (kBtu/ft²) 1985 Weather Normalized Site EUI (kBtu/ft²) 1881 Weather Normalized Site Electricity Intensity (kWh/ft²) 432 Weather Normalized Site Natural Gas Intensity (therms/ft²) 58 Source EUI (kBtu/ft²) 2892 Weather Normalized Source EUI (kBtu/ft²) 2750 Fuel Oil #1 Use (kBtu) 8 Fuel Oil #2 Use (kBtu) 2291 Fuel Oil #4 Use (kBtu) 1319 Fuel Oil #5 & 6 Use (kBtu) 464 Diesel #2 Use (kBtu) 17 District Steam Use (kBtu) 927 District Hot Water Use (kBtu) 2 District Chilled Water Use (kBtu) 12 Natural Gas Use (kBtu) 9951 Weather Normalized Site Natural Gas Use (therms) 9322 Electricity Use - Grid Purchase (kBtu) 11352 Weather Normalized Site Electricity (kWh) 11042 Total GHG Emissions (Metric Tons CO2e) 7882 Direct GHG Emissions (Metric Tons CO2e) 6261 Indirect GHG Emissions (Metric Tons CO2e) 5645 DOF Property Floor Area (ft²) 10103 Property GFA - Self-reported (ft²) 9333 Water Use (All Water Sources) (kgal) 6810 Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) 4788 DEP Provided Water Use (kgal) 5009 Automatic Water Benchmarking Eligible 2 dtype: int64
Outlier Analysis & Treatment¶
numerical_cols = df1.select_dtypes(include=['float64', 'int64']).columns
numerical_cols
Index(['NYC Borough, Block and Lot (BBL)', 'Zip Code',
'Largest Property Use Type - Gross Floor Area (ft²)',
'ENERGY STAR Score', 'Site EUI (kBtu/ft²)',
'Weather Normalized Site EUI (kBtu/ft²)',
'Weather Normalized Site Electricity Intensity (kWh/ft²)',
'Weather Normalized Site Natural Gas Intensity (therms/ft²)',
'Source EUI (kBtu/ft²)', 'Weather Normalized Source EUI (kBtu/ft²)',
'Natural Gas Use (kBtu)',
'Weather Normalized Site Natural Gas Use (therms)',
'Electricity Use - Grid Purchase (kBtu)',
'Weather Normalized Site Electricity (kWh)',
'Total GHG Emissions (Metric Tons CO2e)',
'Direct GHG Emissions (Metric Tons CO2e)',
'Indirect GHG Emissions (Metric Tons CO2e)',
'DOF Property Floor Area (ft²)', 'Property GFA - Self-reported (ft²)',
'Water Use (All Water Sources) (kgal)',
'Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)',
'DEP Provided Water Use (kgal)'],
dtype='object')
for col in numerical_cols:
Q1 = df1[col].quantile(0.25)
Q3 = df1[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df1[col] = df1[col].apply(lambda x: upper_bound if x > upper_bound else (lower_bound if x < lower_bound else x))
df1.describe()
| NYC Borough, Block and Lot (BBL) | Zip Code | Largest Property Use Type - Gross Floor Area (ft²) | ENERGY STAR Score | Site EUI (kBtu/ft²) | Weather Normalized Site EUI (kBtu/ft²) | Weather Normalized Site Electricity Intensity (kWh/ft²) | Weather Normalized Site Natural Gas Intensity (therms/ft²) | Source EUI (kBtu/ft²) | Weather Normalized Source EUI (kBtu/ft²) | ... | Electricity Use - Grid Purchase (kBtu) | Weather Normalized Site Electricity (kWh) | Total GHG Emissions (Metric Tons CO2e) | Direct GHG Emissions (Metric Tons CO2e) | Indirect GHG Emissions (Metric Tons CO2e) | DOF Property Floor Area (ft²) | Property GFA - Self-reported (ft²) | Water Use (All Water Sources) (kgal) | Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) | DEP Provided Water Use (kgal) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.322200e+04 | 12627.000000 | 11527.000000 | 9535.000000 | 11439.000000 | 9848.000000 | 11095.000000 | 9463.000000 | 11439.000000 | 9848.000000 | ... | 1.142500e+04 | 1.109700e+04 | 11478.000000 | 11471.000000 | 11495.000000 | 12627.000000 | 11531.000000 | 7265.000000 | 6391.000000 | 5247.000000 |
| mean | 2.198121e+09 | 10576.384414 | 123855.494326 | 57.735711 | 84.950699 | 84.794121 | 6.916192 | 0.449794 | 139.486963 | 139.603887 | ... | 3.258409e+06 | 9.459251e+05 | 722.204369 | 370.945423 | 292.008752 | 124528.037044 | 128253.021204 | 5942.993930 | 52.978140 | 5784.196598 |
| std | 1.218340e+09 | 576.984717 | 77981.200377 | 30.143817 | 33.794678 | 32.936387 | 4.390756 | 0.363962 | 57.700609 | 56.949520 | ... | 3.041626e+06 | 8.848482e+05 | 534.520737 | 291.261814 | 285.737844 | 78066.569027 | 81411.206767 | 4500.004504 | 34.959198 | 3904.043061 |
| min | 1.000048e+09 | 8227.000000 | 58.000000 | 1.000000 | 8.750000 | 10.762500 | 0.000000 | 0.000000 | 5.175000 | 7.612500 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | -394.300000 | 50008.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 1.013110e+09 | 10024.000000 | 65369.500000 | 34.000000 | 65.300000 | 65.700000 | 3.800000 | 0.100000 | 103.200000 | 103.800000 | ... | 1.045702e+06 | 3.026190e+05 | 342.025000 | 171.500000 | 86.300000 | 65000.000000 | 67176.000000 | 2661.700000 | 28.240000 | 2876.300000 |
| 50% | 2.029920e+09 | 10456.000000 | 92266.000000 | 63.000000 | 82.400000 | 82.500000 | 5.300000 | 0.500000 | 129.200000 | 129.200000 | ... | 1.885996e+06 | 5.454674e+05 | 519.800000 | 305.100000 | 158.000000 | 92400.000000 | 94910.000000 | 4745.600000 | 45.700000 | 4799.800000 |
| 75% | 3.061218e+09 | 11222.000000 | 155728.500000 | 83.000000 | 103.000000 | 102.325000 | 8.900000 | 0.700000 | 168.550000 | 167.925000 | ... | 4.513704e+06 | 1.311119e+06 | 939.450000 | 505.000000 | 406.700000 | 156355.500000 | 161529.500000 | 8057.900000 | 71.680000 | 7732.800000 |
| max | 5.080080e+09 | 11694.000000 | 291267.000000 | 100.000000 | 159.550000 | 157.262500 | 16.550000 | 1.600000 | 266.575000 | 264.112500 | ... | 9.715707e+06 | 2.823868e+06 | 1835.587500 | 1005.250000 | 887.300000 | 293388.750000 | 303059.750000 | 16152.200000 | 136.840000 | 15017.550000 |
8 rows × 22 columns
Encoding the required categorical columns¶
from sklearn.preprocessing import LabelEncoder
one_hot_columns = ['Borough','DOF Benchmarking Submission Status']
label_columns = ['Automatic Water Benchmarking Eligible']
df1_encoded = pd.get_dummies(df1, columns=one_hot_columns, drop_first=True)
label_encoder = LabelEncoder()
for col in label_columns:
df1_encoded[col] = label_encoder.fit_transform(df1_encoded[col])
df1_encoded.head()
| NYC Borough, Block and Lot (BBL) | Property Name | Street Name | Zip Code | Primary Property Type - Self Selected | Largest Property Use Type | Largest Property Use Type - Gross Floor Area (ft²) | 2nd Largest Property Use Type | 2nd Largest Property Use - Gross Floor Area (ft²) | 3rd Largest Property Use Type | ... | Automatic Water Benchmarking Eligible | Borough_Brooklyn | Borough_Brooklyn | Borough_Manhattan | Borough_Manhattan | Borough_Not Available | Borough_Queens | Borough_Staten Island | DOF Benchmarking Submission Status_In Violation | DOF Benchmarking Submission Status_Not on CBL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.003360e+09 | 16 Bialystoker Place | BIALYSTOKER PLACE | 10002.0 | Multifamily Housing | Multifamily Housing | 98000.0 | Not Available | Not Available | Not Available | ... | 2 | False | False | True | False | False | False | False | False | False |
| 1 | 1.010618e+09 | (7478) - 432 West 52nd Street Condominium | WEST 52 STREET | 10019.0 | Multifamily Housing | Multifamily Housing | 57401.0 | Not Available | Not Available | Not Available | ... | 1 | False | False | True | False | False | False | False | False | False |
| 2 | 1.014270e+09 | (9250) - 250 East 73rd Street | 2 AVENUE | 10021.0 | Multifamily Housing | Multifamily Housing | 166433.0 | Not Available | Not Available | Not Available | ... | 0 | False | False | True | False | False | False | False | False | False |
| 3 | 1.015180e+09 | (9141) - 141 East 89th Street | LEXINGTON AVENUE | 10128.0 | Multifamily Housing | Multifamily Housing | 114939.0 | Not Available | Not Available | Not Available | ... | 0 | False | False | True | False | False | False | False | False | False |
| 4 | 1.003920e+09 | 605 East 9th ST | EAST 10 STREET | 10009.0 | Residence Hall/Dormitory | Not Available | NaN | Not Available | Not Available | Not Available | ... | 2 | False | False | True | False | False | False | False | False | False |
5 rows × 51 columns
print(df1['Borough'].unique())
['Manhattan' 'Bronx' 'Not Available' 'Brooklyn' 'Brooklyn ' 'Queens' 'Staten Island' 'Manhattan ']
df1['Borough'] = df1['Borough'].str.strip().str.title()
filtered_df = df1[df1['Borough'] != 'Not Available']
plt.figure(figsize = (15, 10))
sns.boxplot(x = "Borough", y='ENERGY STAR Score', data = filtered_df)
plt.show()
Manhattan has the widest range and highest median emissions, indicating more properties with significant GHG output, likely due to its density.
filtered_df = filtered_df[(filtered_df['Largest Property Use Type'] != 'See Primary BBL') & (filtered_df['Largest Property Use Type'] != 'Other')]
top10_largest_property_types = filtered_df['Largest Property Use Type'].value_counts().head(10).index
filtered_df1 = filtered_df[filtered_df['Largest Property Use Type'].isin(top10_largest_property_types)]
plt.figure(figsize = (18, 8))
sns.barplot(x = "Largest Property Use Type", y='ENERGY STAR Score', data = filtered_df1, errorbar=None)
plt.title("Largest Property Use Type vs ENERGY STAR Score", fontsize=16)
plt.xticks(rotation=45)
plt.show()
"Office" generally exhibit better energy performance compared to other property types.
filtered_df1['Largest Property Use Type'].value_counts()
Largest Property Use Type Multifamily Housing 8569 Office 1264 Hotel 237 Non-Refrigerated Warehouse 194 Senior Care Community 111 Residence Hall/Dormitory 99 K-12 School 93 Retail Store 90 Self-Storage Facility 87 College/University 63 Name: count, dtype: int64
df1_cor = df1[['Largest Property Use Type - Gross Floor Area (ft²)', 'ENERGY STAR Score', 'Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)', 'Weather Normalized Site Electricity Intensity (kWh/ft²)', 'Weather Normalized Site Natural Gas Intensity (therms/ft²)', 'Source EUI (kBtu/ft²)', 'Weather Normalized Source EUI (kBtu/ft²)', 'Natural Gas Use (kBtu)', 'Weather Normalized Site Natural Gas Use (therms)', 'Electricity Use - Grid Purchase (kBtu)', 'Weather Normalized Site Electricity (kWh)', 'Total GHG Emissions (Metric Tons CO2e)',
'Direct GHG Emissions (Metric Tons CO2e)', 'Indirect GHG Emissions (Metric Tons CO2e)', 'DOF Property Floor Area (ft²)', 'Property GFA - Self-reported (ft²)', 'Water Use (All Water Sources) (kgal)', 'Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', 'DEP Provided Water Use (kgal)']]
correlation_matrix = df1_cor.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, cmap="viridis", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
Gross Floor Area and Total GHG Emissions have a strong positive correlation. This suggests that larger buildings tend to produce more greenhouse gas emissions, which makes sense considering their higher energy consumption and operational needs.
import warnings
warnings.filterwarnings("ignore")
borough_counts = filtered_df['Borough'].value_counts().reset_index()
borough_counts.columns = ['Borough', 'Count']
plt.figure(figsize=(15, 10))
sns.barplot(x='Borough', y='Count', data=borough_counts)
plt.title("Count of Boroughs")
plt.xlabel("Borough")
plt.ylabel("Count")
plt.show()
Energy Use Intensity¶
Borough vs Energy Use Intensity¶
eui_data = filtered_df1.groupby('Borough')[['Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)']].mean().reset_index()
eui_data_melted = eui_data.melt(id_vars='Borough', var_name='Metric', value_name='EUI')
plt.figure(figsize=(15, 12))
sns.barplot(x='Borough', y='EUI', hue='Metric', data=eui_data_melted, palette='viridis')
plt.title('Borough vs Mean Site EUI and Weather Normalized Site EUI', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Mean EUI (kBtu/ft²)', fontsize=14)
plt.legend(fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
The Bronx has the highest energy utilization, while Brooklyn exhibits the lowest among all boroughs.
Largest Property Use Type vs Energy Use Intensity¶
eui_data = filtered_df1.groupby('Largest Property Use Type')[['Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)']].mean().reset_index()
eui_data_melted = eui_data.melt(id_vars='Largest Property Use Type', var_name='Metric', value_name='EUI')
plt.figure(figsize=(15, 12))
sns.barplot(x='Largest Property Use Type', y='EUI', hue='Metric', data=eui_data_melted, palette='viridis')
plt.title('Largest Property Use Type vs Mean Site EUI and Weather Normalized Site EUI', fontsize=16)
plt.xlabel('Largest Property Use Type', fontsize=14)
plt.ylabel('Mean EUI (kBtu/ft²)', fontsize=14)
plt.legend(fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Hospitals and senior care facilities have the highest energy intensity, while storage/utility properties exhibit the lowest energy consumption among all property types.
Year Built vs Energy Use Intensity¶
filtered_df1['Year Built'] = pd.to_numeric(filtered_df1['Year Built'], errors='coerce')
filtered_df1 = filtered_df1.dropna(subset=['Year Built'])
filtered_df1['Built Year Group'] = pd.cut(
filtered_df1['Year Built'],
bins=[1800, 1900, 1950, 2000, 2020],
labels=['1800-1900', '1900-1950', '1950-2000', '2000-2020']
)
eui_data = filtered_df1.groupby('Built Year Group')[['Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)']].mean().reset_index()
eui_data_melted = eui_data.melt(id_vars='Built Year Group', var_name='Metric', value_name='EUI')
plt.figure(figsize=(15, 12))
sns.lineplot(x='Built Year Group', y='EUI', hue='Metric', data=eui_data_melted, palette='viridis')
plt.title('Year Built vs Mean Site EUI and Weather Normalized Site EUI', fontsize=16)
plt.xlabel('Built Year Group', fontsize=14)
plt.ylabel('Mean EUI (kBtu/ft²)', fontsize=14)
plt.legend(fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Buildings constructed between 1980-2000 have the highest mean Site EUI and Weather Normalized Site EUI among all built year groups.
Weather Normalized Site Electricity Intensity (kWh/ft²)¶
Borough vs Weather Normalized Site Electricity Intensity (kWh/ft²)¶
plt.figure(figsize=(15, 10))
sns.barplot(x='Borough', y='Weather Normalized Site Electricity Intensity (kWh/ft²)', data=filtered_df1, errorbar=None, palette='viridis')
plt.title("Borough vs Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xlabel("Borough")
plt.ylabel("Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Manhattan exhibits highest Weather normalised site electrical intensity among others.
Largest Property Use Type vs Weather Normalized Site Electricity Intensity (kWh/ft²)¶
plt.figure(figsize=(15, 10))
sns.barplot(x='Largest Property Use Type', y='Weather Normalized Site Electricity Intensity (kWh/ft²)', data=filtered_df1, errorbar=None, palette='viridis')
plt.title("Largest Property Use Type vs Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xlabel("Largest Property Use Type")
plt.ylabel("Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Hotel exhibits highest Weather normalised site electrical intensity among others. Likely due to its business nature.
Year Built vs Normalized Site Electricity Intensity (kWh/ft²)¶
plt.figure(figsize=(15, 10))
sns.lineplot(x='Built Year Group', y='Weather Normalized Site Electricity Intensity (kWh/ft²)', data=filtered_df1, errorbar=None)
plt.title("Built Year Group vs Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xlabel("Built Year Group")
plt.ylabel("Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
The building constructed between 1900-1950, exhibits the lowest weather normalized electricity intensity.
Weather Normalized Site Natural Gas Intensity (therms/ft²)¶
Borough vs Weather Normalized Site Natural Gas Intensity (therms/ft²)¶
plt.figure(figsize=(15, 10))
sns.barplot(x='Borough', y='Weather Normalized Site Natural Gas Intensity (therms/ft²)', data=filtered_df1, errorbar=None, palette='viridis')
plt.title("Borough vs Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.xlabel("Borough")
plt.ylabel("Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.tight_layout()
plt.xticks(rotation=45)
plt.show()
Brooklyn has the highest weather normalized site natural gas intensity among the boroughs.
Largest Use Property Type vs Weather Normalized Site Natural Gas Intensity (therms/ft²)¶
plt.figure(figsize=(15, 10))
sns.barplot(x='Largest Property Use Type', y='Weather Normalized Site Natural Gas Intensity (therms/ft²)', data=filtered_df1, errorbar=None, palette='viridis')
plt.title("Largest Property Use Type vs Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.xlabel('Largest Property Use Type')
plt.ylabel("Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Senior Care Communities have the highest Weather Normalized Site Natural Gas Intensity at approximately 0.8 therms/ft². This could be due to their high energy needs for heating, cooking, and other services required for residents.
Built Year vs Weather Normalized Site Natural Gas Intensity (therms/ft²)¶
plt.figure(figsize=(15, 10))
sns.lineplot(x='Built Year Group', y='Weather Normalized Site Natural Gas Intensity (therms/ft²)', data=filtered_df1, errorbar=None)
plt.title("Borough vs Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.xlabel("Built Year")
plt.ylabel("Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.tight_layout()
plt.show()
The natural gas intensity incerasing gradully over the period till 2000, after that decreasing likely due to the evalution of electrical usages increasing.
Natural Gas Use (kBtu) & Electricity Use - Grid Purchase (kBtu)¶
Borough vs Natural Gas Use (kBtu) & Electricity Use - Grid Purchase (kBtu)¶
energy_data = filtered_df1.groupby('Borough')[['Natural Gas Use (kBtu)', 'Electricity Use - Grid Purchase (kBtu)']].mean().reset_index()
energy_data_melted = energy_data.melt(id_vars='Borough', var_name='Energy Type', value_name='Usage (kBtu)')
plt.figure(figsize=(16, 14))
sns.barplot(x='Borough', y='Usage (kBtu)', hue='Energy Type', data=energy_data_melted, palette='viridis')
plt.title('Energy Usage by Borough and Type (kBtu)', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Usage (kBtu)', fontsize=14)
plt.xticks(rotation=45)
plt.legend(title='Energy Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
Queens has the highest natural gas usage among the boroughs, likely due to its larger residential area and population density.
Largest Property Use Type vs Natural Gas Use (kBtu) & Electricity Use - Grid Purchase (kBtu)¶
energy_data = filtered_df1.groupby('Largest Property Use Type')[['Natural Gas Use (kBtu)', 'Electricity Use - Grid Purchase (kBtu)']].mean().reset_index()
energy_data_melted = energy_data.melt(id_vars='Largest Property Use Type', var_name='Energy Type', value_name='Usage (kBtu)')
plt.figure(figsize=(16, 14))
sns.barplot(x='Largest Property Use Type', y='Usage (kBtu)', hue='Energy Type', data=energy_data_melted, palette='viridis')
plt.title('Energy Usage by Largest Property Use Type (kBtu)', fontsize=16)
plt.xlabel('Largest Property Use Type', fontsize=14)
plt.ylabel('Usage (kBtu)', fontsize=14)
plt.xticks(rotation=45)
plt.legend(title='Energy Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
Senior Care Communities have the highest energy usage in terms of electricity among all property types. This is likely due to the extensive energy requirements for providing consistent heating, cooling, and various services necessary for residents.
Built Year vs Natural Gas Use (kBtu) & Electricity Use - Grid Purchase (kBtu)¶
energy_data = filtered_df1.groupby('Built Year Group')[['Natural Gas Use (kBtu)', 'Electricity Use - Grid Purchase (kBtu)']].mean().reset_index()
energy_data_melted = energy_data.melt(id_vars='Built Year Group', var_name='Energy Type', value_name='Usage (kBtu)')
plt.figure(figsize=(16, 14))
sns.lineplot(x='Built Year Group', y='Usage (kBtu)', hue='Energy Type', data=energy_data_melted, palette='viridis')
plt.title('Energy Usage by Built Year Group (kBtu)', fontsize=16)
plt.xlabel('Built Year Group', fontsize=14)
plt.ylabel('Usage (kBtu)', fontsize=14)
plt.legend(title='Energy Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
The line graph shows that natural gas usage peaked in buildings constructed between 1950-2000, likely due to the boom in construction during that era and the prevalent use of natural gas for heating and other energy needs.
Weather Normalized Site Electricity (kWh)¶
Borough vs Weather Normalized Site Electricity (kWh)¶
plt.figure(figsize=(15, 10))
sns.barplot(x='Borough', y='Weather Normalized Site Electricity (kWh)', data=filtered_df1, errorbar=None, palette='viridis')
plt.title("Borough vs Weather Normalized Site Electricity (kWh)")
plt.xlabel("Borough")
plt.ylabel("Weather Normalized Site Electricity (kWh)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
The bar chart reveals that Manhattan has the highest electricity usage (weather normalized site electricity) compared to the other boroughs, likely due to its dense population and high concentration of commercial buildings.
Largest Property Use Type vs Weather Normalized Site Electricity (kWh)¶
plt.figure(figsize=(15, 10))
sns.barplot(x='Largest Property Use Type', y='Weather Normalized Site Electricity (kWh)', data=filtered_df1, errorbar=None, palette='viridis')
plt.title("Largest Property Use Type vs Weather Normalized Site Electricity (kWh)")
plt.xlabel("Largest Property Use Type")
plt.ylabel("Weather Normalized Site Electricity (kWh)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Office buildings have the highest weather normalized site electricity usage among property types. This is likely due to their large occupancy, extensive lighting, and HVAC needs.
Built Year Group vs Weather Normalized Site Electricity (kWh)¶
plt.figure(figsize=(15, 10))
sns.lineplot(x='Built Year Group', y='Weather Normalized Site Electricity (kWh)', data=filtered_df1, errorbar=None)
plt.title("Built Year Group vs Weather Normalized Site Electricity (kWh)")
plt.xlabel("Built Year Group")
plt.ylabel("Weather Normalized Site Electricity (kWh)")
plt.tight_layout()
plt.show()
Total GHG Emissions & Direct GHG Emissions & Indirect GHG Emissions (Metric Tons CO2e)¶
Borough vs Total GHG Emissions & Direct GHG Emissions & Indirect GHG Emissions (Metric Tons CO2e)¶
emission_data = filtered_df.groupby('Borough')[
['Total GHG Emissions (Metric Tons CO2e)',
'Direct GHG Emissions (Metric Tons CO2e)',
'Indirect GHG Emissions (Metric Tons CO2e)']
].mean().reset_index()
emission_data_melted = emission_data.melt(
id_vars='Borough',
var_name='GHG Emissions',
value_name='Metric Tons CO2e'
)
plt.figure(figsize=(16, 14))
sns.barplot(
x='Borough',
y='Metric Tons CO2e',
hue='GHG Emissions',
data=emission_data_melted,
palette='viridis'
)
plt.title('GHG Emissions by Borough (Metric Tons CO2e)', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Metric Tons CO2e', fontsize=14)
plt.xticks(rotation=45)
plt.legend(title='Emission Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
The bar chart reveals that Manhattan has the highest total GHG emissions among the boroughs, likely due to its dense population and high concentration of commercial buildings.
Largest Property Use Type Vs Total GHG Emissions & Direct GHG Emissions & Indirect GHG Emissions (Metric Tons CO2e)¶
emission_data = filtered_df1.groupby('Largest Property Use Type')[
['Total GHG Emissions (Metric Tons CO2e)',
'Direct GHG Emissions (Metric Tons CO2e)',
'Indirect GHG Emissions (Metric Tons CO2e)']
].mean().reset_index()
emission_data_melted = emission_data.melt(
id_vars='Largest Property Use Type',
var_name='GHG Emissions',
value_name='Metric Tons CO2e'
)
plt.figure(figsize=(20, 16))
sns.barplot(
x='Largest Property Use Type',
y='Metric Tons CO2e',
hue='GHG Emissions',
data=emission_data_melted,
palette='viridis'
)
plt.title('GHG Emissions by Largest Property Use Type (Metric Tons CO2e)', fontsize=16)
plt.xlabel('Largest Property Use Type', fontsize=14)
plt.ylabel('Metric Tons CO2e', fontsize=14)
plt.xticks(rotation=45)
plt.legend(title='Emission Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
The chart shows that hotels have the highest total GHG emissions among all property types, primarily due to their high energy consumption from HVAC systems and lighting.
Built Year Group Vs Total GHG Emissions & Direct GHG Emissions & Indirect GHG Emissions (Metric Tons CO2e)¶
emission_data = filtered_df1.groupby('Built Year Group')[
['Total GHG Emissions (Metric Tons CO2e)',
'Direct GHG Emissions (Metric Tons CO2e)',
'Indirect GHG Emissions (Metric Tons CO2e)']
].mean().reset_index()
emission_data_melted = emission_data.melt(
id_vars='Built Year Group',
var_name='GHG Emissions',
value_name='Metric Tons CO2e'
)
plt.figure(figsize=(20, 16))
sns.lineplot(
x='Built Year Group',
y='Metric Tons CO2e',
hue='GHG Emissions',
data=emission_data_melted,
palette='viridis'
)
plt.title('GHG Emissions by Largest Property Use Type (Metric Tons CO2e)', fontsize=16)
plt.xlabel('Built Year Group', fontsize=14)
plt.ylabel('Metric Tons CO2e', fontsize=14)
plt.legend(title='Emission Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
Total GHG Emissions peaked in buildings constructed between 1950-2000, which could be attributed to the construction boom and the prevalent use of energy-intensive materials and systems during that period.
Water Use (All Water Sources) (kgal)¶
Borough vs Water Use (All Water Sources) (kgal)¶
plt.figure(figsize=(16, 14))
sns.barplot(
x='Borough',
y='Water Use (All Water Sources) (kgal)',
data=filtered_df,
palette='viridis',
errorbar=None
)
plt.title('Borough vs Water Use (All Water Sources) (kgal)', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Water Use (All Water Sources) (kgal)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Largest Property Use Type Vs Water Use (All Water Sources) (kgal)¶
plt.figure(figsize=(16, 14))
sns.barplot(
x='Largest Property Use Type',
y='Water Use (All Water Sources) (kgal)',
data=filtered_df1,
palette='viridis',
errorbar=None
)
plt.title('Largest Property Use Type vs Water Use (All Water Sources) (kgal)', fontsize=16)
plt.xlabel('Largest Property Use Type', fontsize=14)
plt.ylabel('Water Use (All Water Sources) (kgal)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
The bar chart reveals that hotels have the highest water use among all property types, likely due to their extensive guest services, including laundry, pools, and large kitchens.
Built Year Group Vs Water Use (All Water Sources) (kgal)¶
plt.figure(figsize=(16, 14))
sns.lineplot(
x='Built Year Group',
y='Water Use (All Water Sources) (kgal)',
data=filtered_df1,
palette='viridis',
errorbar=None
)
plt.title('Built Year Group vs Water Use (All Water Sources) (kgal)', fontsize=16)
plt.xlabel('Built Year Group', fontsize=14)
plt.ylabel('Water Use (All Water Sources) (kgal)', fontsize=14)
plt.tight_layout()
plt.show()
Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)¶
Borough vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)¶
plt.figure(figsize=(16, 14))
sns.barplot(
x='Borough',
y='Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)',
data=filtered_df,
palette='viridis',
errorbar=None
)
plt.title('Borough vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Largest Property Use Type vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)¶
plt.figure(figsize=(16, 14))
sns.barplot(
x='Largest Property Use Type',
y='Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)',
data=filtered_df1,
palette='viridis',
errorbar=None
)
plt.title('Largest Property Use Type vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=16)
plt.xlabel('Largest Property Use Type', fontsize=14)
plt.ylabel('Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Senior Care Communities have the highest indoor water intensity. This is likely due to the extensive water usage required for services such as bathing, laundry, and general care for residents.
Built Year Group vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)¶
plt.figure(figsize=(16, 14))
sns.lineplot(
x='Built Year Group',
y='Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)',
data=filtered_df1,
errorbar=None
)
plt.title('Built Year Group vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=16)
plt.xlabel('Built Year Group', fontsize=14)
plt.ylabel('Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=14)
plt.tight_layout()
plt.show()
Exploratory Data Analysis¶
Checking Normality of Data¶
import statsmodels.api as sm
norm_cols = ['Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)', 'Weather Normalized Site Electricity Intensity (kWh/ft²)', 'Weather Normalized Site Natural Gas Intensity (therms/ft²)', 'Source EUI (kBtu/ft²)', 'Weather Normalized Source EUI (kBtu/ft²)', 'Natural Gas Use (kBtu)', 'Weather Normalized Site Natural Gas Use (therms)', 'Electricity Use - Grid Purchase (kBtu)', 'Weather Normalized Site Electricity (kWh)', 'Total GHG Emissions (Metric Tons CO2e)', 'Direct GHG Emissions (Metric Tons CO2e)', 'Indirect GHG Emissions (Metric Tons CO2e)', 'DOF Property Floor Area (ft²)', 'Property GFA - Self-reported (ft²)', 'Water Use (All Water Sources) (kgal)']
num_cols = len(norm_cols)
cols_per_row = 3
rows = (num_cols // cols_per_row) + (num_cols % cols_per_row > 0)
fig, axes = plt.subplots(rows, cols_per_row, figsize=(15, rows * 5))
axes = axes.flatten()
for i, col in enumerate(norm_cols):
if col in filtered_df.columns:
col_data = pd.to_numeric(filtered_df[col], errors='coerce').dropna()
if not col_data.empty:
sm.qqplot(col_data, line="s", ax=axes[i])
axes[i].set_title(f"QQ Plot: {col}")
else:
axes[i].set_title(f"No Data: {col}")
else:
axes[i].set_title(f"Column Not Found: {col}")
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
ENERGY STAR Scores¶
from scipy.stats import kruskal
def perform_kruskal_test(df, group_col, target_col):
grouped_data = df.groupby(group_col)[target_col].apply(list)
if len(grouped_data) > 1:
stat, p_value = kruskal(*grouped_data)
print(f"Kruskal-Wallis Test for {target_col} by {group_col}")
print(f"Statistic: {stat:.4f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print(f"Result: Significant difference exists between groups of {group_col}.\n")
else:
print(f"Result: No significant difference between groups of {group_col}.\n")
else:
print(f"Not enough valid groups for {group_col}.\n")
group_columns = ['Largest Property Use Type', 'Borough', 'Year Built']
target_column = 'ENERGY STAR Score'
filtered_df1[target_column] = pd.to_numeric(filtered_df1[target_column], errors='coerce')
filtered_df1 = filtered_df1.dropna(subset=[target_column])
for group_col in group_columns:
perform_kruskal_test(filtered_df1, group_col, target_column)
Kruskal-Wallis Test for ENERGY STAR Score by Largest Property Use Type Statistic: 304.3264, p-value: 0.0000 Result: Significant difference exists between groups of Largest Property Use Type. Kruskal-Wallis Test for ENERGY STAR Score by Borough Statistic: 99.6495, p-value: 0.0000 Result: Significant difference exists between groups of Borough. Kruskal-Wallis Test for ENERGY STAR Score by Year Built Statistic: 545.4667, p-value: 0.0000 Result: Significant difference exists between groups of Year Built.
GHG emissions¶
from scipy.stats import pearsonr
emission_columns = ['Total GHG Emissions (Metric Tons CO2e)',
'Direct GHG Emissions (Metric Tons CO2e)',
'Indirect GHG Emissions (Metric Tons CO2e)']
comp_columns = ["Year Built",
"Weather Normalized Site Electricity Intensity (kWh/ft²)",
"Weather Normalized Site EUI (kBtu/ft²)",
"Weather Normalized Site Natural Gas Intensity (therms/ft²)",
"Natural Gas Use (kBtu)",
"Weather Normalized Site Electricity (kWh)",
"DOF Property Floor Area (ft²)",
"Water Use (All Water Sources) (kgal)"]
for col in emission_columns:
for cl in comp_columns:
valid_data = filtered_df1[[cl, col]].dropna()
pearson_corr, pearson_p = pearsonr(valid_data[cl], valid_data[col])
print(f"Correlation Test for {cl} vs {col}:")
print(f"Pearson: Correlation={pearson_corr}, p-value={pearson_p}")
if pearson_p < 0.05:
print(f"The relationship between {cl} and the {col} is statistically significant.\n")
else:
print(f"The relationship between {cl} Built and the {col} is statistically not-significant.\n")
Correlation Test for Year Built vs Total GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.15259013034354285, p-value=4.838082561006758e-49 The relationship between Year Built and the Total GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site Electricity Intensity (kWh/ft²) vs Total GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.4856422948366791, p-value=0.0 The relationship between Weather Normalized Site Electricity Intensity (kWh/ft²) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site EUI (kBtu/ft²) vs Total GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.4102876267725338, p-value=1.62e-321 The relationship between Weather Normalized Site EUI (kBtu/ft²) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site Natural Gas Intensity (therms/ft²) vs Total GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.029316198047107174, p-value=0.010068827538074178 The relationship between Weather Normalized Site Natural Gas Intensity (therms/ft²) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Natural Gas Use (kBtu) vs Total GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.52093512413176, p-value=0.0 The relationship between Natural Gas Use (kBtu) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site Electricity (kWh) vs Total GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.8537397902299549, p-value=0.0 The relationship between Weather Normalized Site Electricity (kWh) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for DOF Property Floor Area (ft²) vs Total GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.8430661787822864, p-value=0.0 The relationship between DOF Property Floor Area (ft²) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Water Use (All Water Sources) (kgal) vs Total GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.5878536737678786, p-value=0.0 The relationship between Water Use (All Water Sources) (kgal) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Year Built vs Direct GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.03726436492965085, p-value=0.00035280117110480163 The relationship between Year Built and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site Electricity Intensity (kWh/ft²) vs Direct GHG Emissions (Metric Tons CO2e): Pearson: Correlation=-0.0885675251026727, p-value=4.9009147425238425e-17 The relationship between Weather Normalized Site Electricity Intensity (kWh/ft²) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site EUI (kBtu/ft²) vs Direct GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.4280504166183924, p-value=0.0 The relationship between Weather Normalized Site EUI (kBtu/ft²) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site Natural Gas Intensity (therms/ft²) vs Direct GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.2750033377742449, p-value=9.540230181821916e-134 The relationship between Weather Normalized Site Natural Gas Intensity (therms/ft²) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Natural Gas Use (kBtu) vs Direct GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.6670485914026987, p-value=0.0 The relationship between Natural Gas Use (kBtu) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site Electricity (kWh) vs Direct GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.2363697257738262, p-value=9.553896549835233e-114 The relationship between Weather Normalized Site Electricity (kWh) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for DOF Property Floor Area (ft²) vs Direct GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.41156033341204634, p-value=0.0 The relationship between DOF Property Floor Area (ft²) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Water Use (All Water Sources) (kgal) vs Direct GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.4849217766361128, p-value=0.0 The relationship between Water Use (All Water Sources) (kgal) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Year Built vs Indirect GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.18980566031154897, p-value=2.314870990698141e-75 The relationship between Year Built and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site Electricity Intensity (kWh/ft²) vs Indirect GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.7147603927198645, p-value=0.0 The relationship between Weather Normalized Site Electricity Intensity (kWh/ft²) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site EUI (kBtu/ft²) vs Indirect GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.21362846978777594, p-value=5.345671579890487e-83 The relationship between Weather Normalized Site EUI (kBtu/ft²) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site Natural Gas Intensity (therms/ft²) vs Indirect GHG Emissions (Metric Tons CO2e): Pearson: Correlation=-0.12746269213152003, p-value=2.8249315800464165e-29 The relationship between Weather Normalized Site Natural Gas Intensity (therms/ft²) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Natural Gas Use (kBtu) vs Indirect GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.3099632385025462, p-value=5.415716841659371e-181 The relationship between Natural Gas Use (kBtu) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Weather Normalized Site Electricity (kWh) vs Indirect GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.9635466262650673, p-value=0.0 The relationship between Weather Normalized Site Electricity (kWh) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for DOF Property Floor Area (ft²) vs Indirect GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.7665161539874037, p-value=0.0 The relationship between DOF Property Floor Area (ft²) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant. Correlation Test for Water Use (All Water Sources) (kgal) vs Indirect GHG Emissions (Metric Tons CO2e): Pearson: Correlation=0.42686714753730404, p-value=3.375328044704652e-262 The relationship between Water Use (All Water Sources) (kgal) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant.
Pearson Correlation Test: ENERGY STAR Score vs Site EUI¶
correlation_data = filtered_df[['ENERGY STAR Score', 'Site EUI (kBtu/ft²)']].dropna()
correlation, p_value = pearsonr(correlation_data['ENERGY STAR Score'], correlation_data['Site EUI (kBtu/ft²)'])
print("Pearson Correlation Test:")
print(f"Correlation coefficient: {correlation:.4f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print("Reject the null hypothesis: There is a significant correlation between ENERGY STAR Score and Site EUI.")
else:
print("Fail to reject the null hypothesis: No significant correlation between ENERGY STAR Score and Site EUI.")
Pearson Correlation Test: Correlation coefficient: -0.7414, p-value: 0.0000 Reject the null hypothesis: There is a significant correlation between ENERGY STAR Score and Site EUI.
Chi-Square Test: Primary Property Type vs Borough¶
from scipy.stats import chi2_contingency
pivot_table = filtered_df1.pivot_table(index='Primary Property Type - Self Selected',
columns='Borough',
aggfunc='size',
fill_value=0)
plt.figure(figsize=(15, 8))
sns.heatmap(pivot_table, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Heatmap of Property Types by Borough', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Primary Property Type', fontsize=14)
plt.tight_layout()
plt.show()
Multifamily Housing is the most common property type in Manhattan, with 2714 buildings, reflecting the high population density and residential demand in the borough.
contingency_table = pd.crosstab(filtered_df['Primary Property Type - Self Selected'], filtered_df['Borough'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print("Chi-Square Test:")
print(f"Chi-Square Statistic: {chi2:.4f}, p-value: {p_value:.4f}, Degrees of Freedom: {dof}")
if p_value < 0.05:
print("Reject the null hypothesis: There is an association between Primary Property Type and Borough.")
else:
print("Fail to reject the null hypothesis: No association between Primary Property Type and Borough.")
Chi-Square Test: Chi-Square Statistic: 2249.3423, p-value: 0.0000, Degrees of Freedom: 204 Reject the null hypothesis: There is an association between Primary Property Type and Borough.
Insights¶
Energy Usage and Efficiency¶
Energy Efficiency by Borough: Manhattan displays higher energy efficiency, possibly due to modernized infrastructure.
Weather-Normalized EUI: Weather-normalized Site EUI shows less variability, highlighting standardization benefits in energy reporting.
Primary Property Type Impact: Commercial properties have higher Site EUIs compared to residential properties, indicating higher energy demands.
Built year: Older buildings before 1950 had lower energy intensity. Buildings from 1950-2000 have higher energy use, likely due to inefficient systems and increased energy demand. Post-2000 improvements showcase a strong impact of modern energy-saving measures, sustainable designs, and policies.
GHG Emissions¶
GHG Emissions Trends: Indirect GHG emissions contribute significantly compared to direct emissions, suggesting electricity reliance.
Borough GHG Variance: Manhattan and Brooklyn have higher total GHG emissions due to higher property density.
Building Age Analysis¶
Old vs. New Buildings: Older buildings exhibit higher energy and water usage, showing inefficiencies in older technologies.
Modern Properties: Newer buildings have lower GHG emissions, aligning with stricter building codes and energy-efficient designs.
Water and Gas Usage¶
Water Usage: Water consumption trends align with property size and type, with larger properties showing higher water use.
Natural Gas Usage: Larger commercial buildings exhibit a spike in natural gas usage.
Correlations and Trends¶
Correlation Between Variables: Energy intensity correlates positively with GHG emissions but shows weak correlation with property floor area.
Energy Star Score: Buildings with higher Energy Star scores show significantly reduced emissions, validating score relevance.
Grouped Comparisons¶
Borough-Property Type Pairing: Combining Borough and Primary Property Type reveals hotspots for energy inefficiency, especially in older commercial buildings.
Built Year Grouping: Older buildings grouped by built year consistently show higher EUI and emissions metrics.
Hypothesis Testing¶
Kruskal-Wallis Test: Significant differences in GHG emissions among boroughs suggest location-based energy policies might be effective.
Comparison by Property Type: Hypothesis testing confirms energy performance varies significantly by property type, emphasizing tailored efficiency programs.